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input or query file to a set of files to detect similarities 
between the query file and the set of files, and digitally 
shredding files that match, to some degree, the query file and 
doing so from within the comparison feature. Using a 
comparison program, the query file is compared with each 
non-query file in a data processing system, ranging from a 
stand-alone computer to an enterprise computing network. A 
list of non-query files having some degree of similarity with 
the query file is compiled and presented to the user via a user 
interface within the comparison program. Certain or all 
non-query files can then be deleted by marking the names of 
those non-query files in the list. The comparison program 
can be of the type using either clustering or coalescing, or 
both, known hashing techniques, or other comparison algo- 
rithms. 
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METHOD AND APPARATUS FOR ods include tendencies to provide false positive matches and 

DIGITALLY SHREDDING SIMILAR presenting output or results in a form difficult to quickly 

DOCUMENTS WITHIN LARGE DOCUMENT evaluate. False positives arise because it is sometimes dif- 

SETS IN A DATA PROCESSING ucu * 1 f° P rcv ent dissimilar documents from having similar 

ENVIRONMENT 5 statistical profiles. With respect to presentation, these meth- 
ods often simply provide correlations. In sum, these methods 

CROSS REFERENCE TO RELATED can °f teo provide too little information about similarities or 

APPLICATION differences among documents thus requiring the user to 

. closely evaluate the results and refer back to the files being 

This is a Continuation-in-part application of copending 1Q compared to determine whether meaningful differences or 

prior application Ser. No. 09/127,105 filed on Jul. 31, 1998, similarities exist 

now ^.S. Pat. No. t i,240,409 issued [May 29, 2001 which the Mtr mcthod ^ based on a durc ^ Qym as 

disclosure of which is incorporated herein by reference. document fingerprinting. Fingerprinting a document 

BACKGROUND OF THE INVENTION c JSTJ^lSf V! ^Tt * * ^ 

15 ment. Aparticular set of substring hashes chosen to represent 

1. Field of the Invention a document is the document's fingerprint. The similarity of 
The present invention relates generally to computer appli- two doc ™ents is defined as a ratio QT where C is the 

cations and programming. More specifically, it relates to ™ mbe J of h f hes | he two documents have in common and 

utility programs used to detect similarities and differences T 15 me total number of hashes taken of one of the docu " 

among multiple documents of the same or different type. 20 meDts * Assuming * well-behaved hash function, this ratio is 

~ ~. . cn . , A a good estimate of the actual percentage overlap between the 

2. Discussion of Related Art * j , TT 4l T , to ~ . 

two documents. However, this also assumes that a sufficient 

A common feature or utility in some word processing number of substring hashes are used. Various approaches 

programs and operating systems is the ability to compare have been used in determining which substrings in a docu- 

files and provide information on differences (or similarities) 25 mcnt are selected for hashing and which of these substring 

between the files. There are a variety of file comparison hashes are saved as part of the document fingerprint. One 

programs available which have different limitations and way is to compute hashes of all substrings of a fixed length 

capabilities, for example, with regard to how and what k an d retain those hashes that are 0 mod p for some integer 

comparison data is presented or the number of files that can p> Another way is partitioning the document into substrings 

be compared in one run. Many of these programs are 30 ^h hashes that are 0 mod p and saving those hashes. The 

adequate in certain aspects but have drawbacks in others difference from the first way is that the substrings selected 

making them poorly suited for certain applications. This is are not Q f fixed length. In this method, a character is added 

particularly true given the constantly growing trend to store, t o a substring until the hash of the substring is 0 mod p, at 

submit, transfer, copy, and otherwise manipulate informa- which point the next substring is formed. In order to reduce 

tion electronically, 35 me mory requirements, the program can set p to 15 or 20 

One utility used to compare files in the UNIX operating thereby saving, in theory, every 15th or 20th hash value, 
system is known as diff This program can compare up to However, based on probability theory, for a large body of 
three files or documents. The output of this program is documents, there will be large gaps where no hash value will 
typically two columns of data. One column displays line be saved. This can potentially lead to the situation where an 
numbers in one (subject) document across from a second 40 entire document is bypassed without having a single sub- 
column displaying line numbers in the query document that string hash value saved for a fingerprint. More generally, if 
are different from corresponding line numbers in the subject gaps between stored hash values are too long, a document's 
document. Thus, the diff utility is used when the documents fingerprint will be faint or thin and, thus, ill-suited for 
are assumed to be generally similar. The program uses a comparison to other documents. 

dynamic programming algorithm that computes the minimal 45 Another related feature useful to many types of organi- 
"edit distance" between two documents. An "edit distance" zations is the ability to purge or delete documents and files 
between two documents, or strings, is the length of a containing redundant material. A feature of this type is 
minimal sequence of insertions, deletions, and substitutions useful for a variety of reasons, such as making better use of 
that transforms one to the other. From information about memory by deleting multiple copies of the same document 
how the minimal edit distance is derived diff computes 50 or keeping better track of multiple versions of the same 
matching passages in the two documents, which are pre- document within an organization. Importantly, many orga- 
sented to the user in the column format described earlier. The nizations temporarily use proprietary documents. When the 
program can not find differences among sets or large bodies time comes to delete the proprietary material, it is important 
of documents, but typically between two or among three to locate and delete all documents that may include frag- 
documents at most. 55 m ents of the original proprietary documents. Comparison 
Other methods of comparing files can be broadly catego- functions described above and as well as others generally do 
rized as information retrieval methods. These methods com- not include the additional feature allowing a user to delete or 
pare statistical profiles of documents. For example, one "shred" documents or passages that match a query docu- 
strategy used by these methods is computing a histogram of ment. Further, present comparison programs are largely 
word frequencies for each document, or a histogram of the 60 inadequate for properly identifying the full complement of 
frequency of certain pairs or juxtaposition of words in a documents in a corpus that may include significant overlap- 
document. Documents with similar histograms are consid- ping content with a proprietary query document (e.g. on 
ered to be similar documents. Refinements of these methods original document). Note also that in current approaches, a 
include document preprocessing (e.g. removing unimportant user has to exit a comparison program, after manually or 
words) prior to computing the statistical profile and applying 65 mentally noting which documents are to be deleted, and use 
the same information retrieval method to subsections of typical operating system commands to delete the documents, 
documents. Some of the primary drawbacks of these meth- In other words, the deletion process is separated from the 
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comparison function thereby increasing the possibility of predetermined offset is executed. The program is able to 

deleting the wrong documents and making the process detect a similar passage between two or more files where the 

further time-consuming. A document shredding component passage has a length of at least the sum of the predetermined 

inherent in a comparison program would allow a user to length and the predetermined offset, 

delete documents efficiently and with the reduced possibility s [ n another aspect of the present invention, a method of 

of committing errors in deleting wrong documents or leav- comparing a first string and a second string is described. The 

ing out documents meant to be deleted. fi^t string is divided into multiple substrings of length 1 and 

Therefore, it would be desirable to determine similarities offset or gap g between two substrings, where g is at least 

among large sets of documents in a manner that guarantees two characters long. A substring of length 1 is selected from 

that if a substring of a predefined length in one of the 10 the second string. It is then determined whether the substring 

documents appears in another document, it will be detected, of length 1 from the second string matches any of the 

and thereby not rely on probability for measuring compari- multiple substrings from the first string. If the substring from 

son accuracy. In addition, it would be desirable to present the second string matches any substring from the first string, 

comparison results in a meaningful and easily comprehen- the substring from the second string is saved, at least 

sible format to users thereby enabling quick evaluation of 15 temporarily. Finally, it is indicated that the substring from 

document similarities. It would also be desirable to be able the second string matches a particular substring from the 

to delete or otherwise manipulate documents similar to a first string. 

query document without having to exit a document matching In another aspect of the present invention, a method of 

program, thereby enhancing a document comparison fea- digitally shredding documents based on the documents 

ture. It would be desirable to give a user the option to be 20 similarity with one or more query documents and doing so 

presented with a user interface that facilitates the deletion of from within a document comparison program is described, 

documents having a certain percentage of similarity with A first string, representing a query document, is compared 

one or more query documents. with a group of second strings representing a corpus of 

non-query documents. A list of second string names taken 

SUMMARY OF THE INVENTION 25 from thc group of strings ^ compiled. Each second 

To achieve the foregoing, and in accordance with the slrin S corresponds to a name from the list of second string 
purpose of the present invention, methods, apparatus, and namcs and matchcs the first strm S to a dc S rcc or Percentage 
computer program products for comparing an input or query g reater thaa a Particular threshold degree or percentage, 
file to a set of files to detect similarities and formatting the , n Second stnn S namcs corresponding to second strings (i e., 
output comparison data are described. In one aspect of the non-query documents) are deleted from the list of names 
present invention, a method of comparing files and format- thercb y eliminating copies of the first string (i.e., query 
ting output data involves receiving an input query file that documents) and remnants, such as partial copies or deriva- 
can be segmented into multiple query file substrings. A Uves of me first stnn g from the data Pressing system, 
query file substring is selected and used to search an index 35 BRIEF DESCRIPTION OF THE DRAWINGS 

file containing multiple ordered file substrings that were ™ . ^. . 4 , . 4 , , , # , f 

4 . c . i . JC1 Tf41 _ T*j The invention, together with further advantages thereof, 

taken from previously analyzed files. If the selected query l 4L j * j l * r *u r « 

i_ , • . . c t u i ji j <ii may best be understood by reference of the following 

file substring matches any of the multiple ordered file j • *• » i ■ • *■ *u *u • 

, . . , • j , . f . , . t , description taken in conjunction with the accompanying 

substrings, match data relating to the match between the drawin s in which* 

selected query file substring and the matching ordered file , n 

substring is stored in a temporary file. The matching ordered FIGS U ~ b 15 a flowchart showing a method of hashing, 

file substring and another ordered file substring are joined if comparing, storing a query documents against documents 

the matching ordered file substring and the other ordered file aIrcadv . stored in an index file in accordance with one 

substring are in a particular sequence and if the selected embodiment of the present invention, 

query file substring and a second query file substring are in 45 FI 9' 2 a block dia gram of an index file and of records 

the same particular sequence. If the matching ordered file contained in the index file in accordance with one embodi- 

substring and the second query file substring match, a ment of *e present invention. 

coalesced matching ordered substring and a coalesced query FIG. 3 is a diagram showing a transformation of a raw 

file substring are formed that can be used to format output data string to a series of substrings using 1 and g in 

comparison data. 50 accordance with one embodiment of the present invention. 

In another aspect of the present invention, a method of PIG - 4a is a flowchart showing in greater detail step 126 

comparing two strings in a data processing system, where of mG lfe in wmc h a current document is clustered based 

the strings can represent various types of documents or files, 011 matches with documents previously loaded into the index 

is described. Substrings common to the strings are identified. f^ e ' 

A subset of substrings, from within the common substrings, 55 FIG. 4 & is an illustration of a format of a match list in 

which occur in the same relative positions in the two strings accordance with one embodiment of the present invention, 

are identified. Substrings which are present in the same FIG. 4c is an illustration of a data structure showing how 

relative positions in the two strings are then stored as a group documents can be clustered in accordance with one embodi- 

or displayed as a group. ment of the present invention. 

In another aspect of the present invention, a method of 60 FIG. 5 is a flowchart showing in greater detail step 130 of 

segmenting a file, representable as a string of characters, as FIG. lb of coalescing matching data segments into passages 

one step in a file matching program is described. Multiple and presenting output to users in accordance with one 

substrings or segments from the string of characters having embodiment of the present invention, 

a predetermined length and a beginning position are created. FIG. 6 is a flowchart describing a process of comparing a 

A predetermined offset or gap between the beginning posi- 65 query document or string against a corpus of documents and 

tions of each consecutive segment is maintained. A file deleting corpus documents matching the query document in 

matching program using the multiple segments and the accordance with one embodiment of the present invention. 
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FIG. 7 is a block diagram of a typical computer system the result of translating or preprocessing is a series of token 

suitable for implementing an embodiment of the present and position pairs <T,P> that are suitable for segmenting and 

invention. hashing. In other preferred embodiments, the abstraction of 

the raw data string to a preprocessed data string can be 
DETAILED DESCRIPTION 5 performed using a variety of abstraction operations or trans- 
lation sets that reduce the amount of data in the raw data 
Reference will now be made in detail to a preferred string. This abstraction typically makes subsequent process- 
embodiment of the invention. An example of the preferred m g f ar more efficient. 

embodiment is illustrated in the accompanying drawings. At a step m the next or first substring of length 1 is 
While the invention will be described in conjunction with a se]ected and a position marker is incremented by one to 
preferred embodiment, it will be understood that it is not indicate the beginning of the selected substring. An appro- 
intended to limit the invention to one preferred embodiment. priate length { can be empirically based on the type 
To the contrary, it is intended to cover alternatives, of documents that are being queried and loaded into the 
modifications, and equivalents as may be included within mdex me> Substring length 1 is the length of a substring 
the spirit and scope of the invention as denned by the ^ ^thin a translated string that is hashed and stored in the 
appended claims. index me num b er of substrings within the translated 

FIG. 1 is a flowchart showing a method of querying and string that is hashed is determined by an offset or gap g 
loading a document into an index file in accordance with one discussed in greater detail below and in FIG. 3. These 
embodiment of the present invention. At a step 102 a corpus values, specifically 1, can be chosen based on experience 
or collection of documents that are to be compared against 2Q working with the type of documents being hashed and 
each other is gathered. A document can be any logical entity loaded, or can be determined based on an intuitive or natural 
such as a set of files comprising one program or multiple feeling of how many characters or words one has to see to 
sections (e.g. attachments) in an e-mail message. The docu- suspect that some copying has occurred between two docu- 
ments in the collection can be of the same type or have ments. However, in the described embodiment substring 
different types. For example, each file is a computer program 25 length 1 and offset g are each constant for all documents of 
in a particular language or a database file organized accord- a particular type that are compared against one another, 
ing to a particular database program. At a step 104 the first Normal text documents may have a substring length in the 
or next document in the collection is selected for comparison 30 to 40 character range. For computer programs, 1 may be 
against the documents already loaded. If it is the first in the 40 to 50 character range depending on the program- 
document, the index file (described below) containing hash 30 ming language. For executable or binary files, 1 can be in the 
and position values is empty. In either case, in the described range of several hundred characters, 
embodiment, a position corresponding to the beginning of At a step 110 a hash function is applied to the selected 
the selected document is stored in a B-tree or similar substring within the translated string, or document. The hash 
structure. As described in FIG. 2, a page or block in the function creates a hash value of a fixed length m. The hash 
index file can be expanded or appended with additional 3J value is stored in the index file which, in the described 
pages if a current page becomes full. embodiment, is an extensible hash table made up of a linked 

At a step 106 the document is translated or preprocessed list of hash pages described in greater detail in FIG. 2 One 
from its original (e.g. human readable) format to a format purpose of using a hash function is to maintain a random 
suitable for segmenting and hashing. For the purposes of distribution of hash values in the index file. Any well- 
illustrating the described embodiment, a document is 40 behaved hash function can be used in the described embodi- 
referred to as a string, such as a string of alphanumeric ment. One criteria for a well-behaved hash function is not 
characters. A sample string can be a string of characters returning the same hash value for two different substrings, 
comprising a sentence or a line of computer code. In the An example of a hash function includes taking the product 
described embodiment, the string is translated to a token of a numeric representation of a character and a prime 
string that represents and preserves the structure and content 45 number. Each character in a substring must be part of an 
of the original or raw data string. Each string (i.e. document) alphabet, such as the ASCII character set. Each member of 
is translated according to its document type. Translation this character set has an associated unique prime number, 
rules are tailored specifically for the type or types of Another prime number, p, larger than any prime number 
documents being translated, such as the syntax and seman- corresponding to the character set is chosen. This number is 
tics of a particular programming language or common or 50 raised to a certain power and multiplied by the prime number 
frequent words expected in documents from a particular corresponding to a character. These products are then 
source. summed. For example, if the substring contains the prime 

An example of a raw data string and a translated version numbers 7,3,9, the first part of the hash function would be 

of the same string is shown in FIG. 3. In that example, an the calculation 7p 3 +3p 2 +9p. The final hash value is the 

English sentence is translated by having punctuation, white 55 modulus of this sum by 2 32 which is the word length of the 

spaces, and capitalization removed. Further processing can computer. This number can vary depending on the type of 

include removing unimportant words such as "the" or "and/* computer being used. In other preferred embodiments, hash 

In another example using a computer programming functions using other formulas and calculations can be used, 

language, a string containing computer instructions having At a step 112 the program queries the index file for the 

real variable names and operators is translated to a token 60 hash value calculated at step 110. The index file will not 

string. Thus, in the described embodiment, the string: if contain any hash values to query against if the document is 

sales_revenue>operating_costs then projections=TRUE, the first document in the collection. However, it is possible 

can be translated to the token string: if <var>op <var> then to have a substring occur more than once in a single 

<var>op true. In addition, in the described embodiment, the document in which case the index file may contain a hash 

token string includes position data indicating the position of 65 value to query against. As described in FIG. 2, the first n bits 

the tokens in the original document. This position data is of the calculated hash value is used to identify a hash page 

used later in presenting the comparison data to a user. Thus, in the index file. Thus, the first n bits of the current hash 
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value are used to identify a certain hash page in the index file 
and that page is searched for the remaining m-n bits in the 
current hash value. In the described embodiment a hash page 
can have overflow pages associated with it that may also 
need to be searched. 5 

At a step 114 the program stores data relating to any 
matches found in the index file after queried for a current 
hash value. In the described embodiment, a list of <hash 
value, position> pairs are stored in a temporary file. The 
index file stores positions of each hash value. Thus, at step 10 
114 if a current hash value is found in the index file, the 
position and value stored in the index file matching the 
current hash value is stored in the temporary file until all the 
substrings in the query document (as described in step 108) 
have been hashed and searched for in the index file. In the 15 
described embodiment, a position value encodes (i.e. it does 
not explicitly state) the name of the document and an offset 
within that file where the hash value begins. Thus, this 
position value performs as an absolute position or address 
within the collection or corpus of documents insofar that it 20 
can be used to go directly to a position within a document 
regardless of where that document resides in the corpus of 
documents. If the collection of documents being compared 
against each other are expected to be dissimilar, step 114 will 
normally result in small amounts of matching data or none ^ 
at all. However, this depends on the nature of the collection 
of documents being compared. 

To save memory, not every substring's hash value is 
saved. In a preferred embodiment only those substrings 
beginning at or near fixed boundaries in a document (string) 30 
are saved. At a step 116 the program checks whether it has 
passed a particular position or boundary in the string. This 
position, referred to as the gth position (for example every 
5th or 10th position in the string), is an offset or gap between 
the beginning of every new substring and the previous 55 
substring. At step 116 the program determines whether it has 
passed the gth position since having saved (or stored) the last 
hashed substring. Each time the program passes the gth 
position it will want to save another hash value and generally 
it will not want to save more than every gth substring. If the 40 
program has passed a gth position in the string, it will 
increment a g counter at a step 118. 

If the program determines that it has not passed a gth 
position at step 116 or if the program increments the g 
counter at step 118, control goes to a step 120 where the 45 
program checks whether the g counter is greater than zero 
and whether the hash is 0 modulo j for a predetermined value 
j. In the described embodiment, j has a value that is less than 
g. By using 0 mod j to determine which substrings to save 
(described in step 122 below) in the described embodiment, 50 
the program is able to reduce the number of substring hashes 
that need to be queried at step 112. Only those substrings that 
have a hash value that is evenly divisible by j need to be 
searched for in the index file. Returning to step 116, once a 
gth boundary or position is passed, the program is ready to 55 
save another hash value. It will do this the next time it 
determines that 0 mod j is true for a hash value of the current 
substring. 

At step 120, if the g counter is greater than zero 
(indicating that the program is ready to save another hash 60 
value) and the hash value is evenly divisible by j, the hash 
value of the substring and its position in the document is 
saved in the index file at a step 122. The g counter is also 
decremented by one at step 122. Normally this will reset the 
counter to zero but it is possible that the counter was greater 65 
than one if the 0 mod j condition had not been met within 
several substrings of length g. When the hash value and 
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position are saved at step 122, the index table may need to 
be updated. The size of the index file is increased if certain 
criteria are met. For example, if currently existing hash 
pages or blocks are appended with overflow pages to the 
point where access time for searching for a substring 
exceeds a predetermined value, the size of the entire index 
file can be doubled. This doubling of size will occur when 
the number of hash pages in the index file is set according 
to 2 M , where n is increased by one whenever the index file 
needs to be expanded. When this occurs, the addresses or 
boundaries of the newly formed hash pages change from 
their values before the index file was extended. The 
addresses of the hash pages do not change when individual 
hash pages are appended with overflow pages since the 
overall structure of the index file does not change. 

The program checks whether the last position or character 
in the current document has been reached at a step 124 if it 
is determined at step 120 that the g counter is zero or the 
hash value of the substring is not evenly divisible by j. The 
program also goes to step 124 after saving a <hash value, 
position> pair and decrementing the counter at step 122. If 
the end of the document has not been reached, control 
returns to step 108 where the next substring of length 1 is 
selected and the process is repeated. If the last character in 
the document has been read, the program performs a clus- 
tering operation that integrates or incorporates the current 
document into an existing cluster of documents if the 
program determines that the current document has a suffi- 
cient number of matches with any of the other previously 
loaded documents. The clustering is preferably done using 
the union/find operation. The union/find algorithm is a 
method known in the field of computer programming. Step 
126 is described in greater detail in FIG. 4. 

Control then goes to step 128 where it is determined if 
there are any other documents in the collection of documents 
received at step 102. If there are more documents, control 
goes to step 104 where the next document is selected, 
followed by preprocessing and the other steps described 
above. If the last document has been examined, the program 
goes to step 130 where the data relating to the matching hash 
values is coalesced into passages and presented to the user. 
This process is described in further detail in FIG. 5. After the 
data has been coalesced at step 130 the comparison of the 
collection of documents is complete. 

FIG. 2 is a block diagram of an index file and of records 
contained in the index file in accordance with one embodi- 
ment of the present invention. The index file, also referred 
to as a hash table, contains a portion of a substring hash 
value followed by position data. In other preferred embodi- 
ments the index file can be implemented using other data 
storing constructs such as a neural network. For example, a 
neural network can be trained to recognize substrings it has 
seen before and a query document can then be run through 
the network in order to match substrings. In the described 
embodiment, a hash value is computed on a substring of 
length 1 (typically measured in characters) and is made up of 
m bits. A hash value 202 is shown at block 204. In the 
described embodiment, m is 32 to 64 bits. A first portion of 
hash value 202 is an index 206 of length n bits, typically 8 
to 16 bits, that acts as a pointer to a hash table 208. A value 
210, the remaining portion of hash value 202, is stored in a 
hash table record 212 in table 208. Position data 214 is also 
32 to 64 bits long and is stored following value 210. As 
described above, position data 214 contains the name of the 
document or file that is being hashed and stored followed by 
the offset within the document where the substring is 
located. In other preferred embodiments, a non-numerical 
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based hash fraction can be used to build the index file. For tion that they are words that would be used frequently 
example, a semantic-based function where letters in a sub- anyway and would not be useful indicators of copying, 
string can be used to distribute substrings in the index file. Substring length 1 and offset or gap g are then used to 
More broadly, the index file can be seen as an association list segment translated data string 306. Length 1 can be deter- 
in which substrings can be indexed to some other value. 5 mined empirically and can vary widely depending on the 
In a preferred embodiment, preceding value 210 in record type of documents being stored for future comparison. For 
212 is a single-bitfield 216 that indicates whether value 210 a normal text file 1 is typically in the range of 30 to 40 
represents a substring that appears more than once in the characters. Typically when a person sees this number of 
index file. In the described embodiment, if this bit is zero, consecutive characters in two documents, copying is sus- 
value 210 represents only one substring in the index file, 10 pected. The number will likely be different for other types of 
which is expected under most circumstances. That is, it is documents such as computer programs or files storing 
not expected that an exact duplicate of a substring appear records in a database. The offset or gap g between hashed 
even once in a set of documents. However, should this occur, substrings is determined by availability of storage and the 
field 216 will contain a one and a variation of record 212, level of success or probability in finding matches among 
shown as a record 217 will have an additional count field 15 documents. 

218 that will contain the number of times a substring appears For the purposes of illustration, in FIG. 3 length 1 is three 

in the index file. Count field 218 is followed by multiple and the offset g is two. In the described embodiment, g must 

position fields 222 each of which encodes the same data as be less than 1, and in most cases will be significantly smaller 

position data 214. than 1. Brackets 308 illustrate how translated string 306 is 

Index file 208 is typically comprised of multiple hash 20 segmented. Each segment is three characters long and each 

pages, an example of which is shown at 224. In the described new segment begins two characters after the beginning of 

embodiment the number of pages is base two. Thus, there is the previous segment. This results in six substrings 310, 

initially one page in the index file which can expand to two, which may include duplicate substrings. A hash function is 

four, eight, 16, and so on, when needed. At the beginning of applied to each of the substrings, as described in step 106 of 

each page is a page header 226. All the fields in header 226 25 FIG. 1, to derive a hash value 202. Position data for each of 

are fields normally expected in a header for a page in a hash the substrings is also stored in the index file. For example, 

table. One field worth noting is a page overflow field that a position value for substring "for encodes the name of raw 

indicates whether the hash page has overflow pages by data string 302 (e.g. "sample text. doc") and its offset within 

containing a pointer to the first overflow page. Step 122 of the string, which in this case is byte 11. 

FIG. 1 includes updating the index file and data structure for 30 In the example shown in FIG. 3, 1+g is five characters 

determining a position of a substring in a document and for long. If a second data string, i.e. a query document, is 

storing data related to a particular document. The data compared against data string 302 and contains a substring of 

structure referred to can be a B-tree type structure that length five that has the same consecutive characters as any 

contains information on which document and offset is substring of length five in string 302, a comparison method 

described given a particular <hash value, position> pair. In 35 based on a preferred embodiment will detect that three of the 

other preferred embodiments, a binary tree or simple look- five characters in the substrings match. Thus, if the query 

up table can be used to store this information. document contains "thisi" or "tfolk" for example, this simi- 

Briefly, in the described embodiment, each leaf node in larity to raw data string 302 will be detected and presented 

the B-tree contains a code indicating the name of a document to the user. By increasing g, or I, a longer identical substring 

and the range of bytes in that document. The B-tree can also 40 must be present in the query document in order for the 

contain the total number of hashes in a particular document. comparison program to guarantee the detection of the simi- 

By following the nodes in the B-tree, the program can larity. Thus, in another example where index space is more 

determine which document a particular position value limited and g is four instead of two (and 1 is greater than 

belongs to or, similarly, the beginning and ending bytes of four), the query document would have to contain a substring 

each document. In the described embodiment, the position 45 (the sum of 1 and g) of length seven in order for the 

value encodes the name of the document and the offset comparison program to detect me similarity. Substrings such 

within that document where the hash value begins. This as "thisisi" or "itfolks" would have to be present in the query 

B-tree structure is used by the program to retrieve data document for the similarity to be detected, 

regarding the boundaries of documents, the total number of As mentioned above with respect to step 120 of FIG. 1, in 

hash values in a particular document, document type (if 50 the described embodiment, the way a substring is chosen for 

needed), and other related information. Thus, a position storage in the index file depends not only on offset g but also 

value can be inserted into the B-tree and a particular on the condition 0 modj criteria thereby introducing the 

document and offset can be determined. variable j. Every hash value of the current substring that 

FIG. 3 is a diagram showing a transformation of a raw satisfies 0 mod j after having passed a g boundary in the 

data string to a series of substrings of length 1 and gap g in 55 string is stored in the index file. By using the 0 modj criteria 

accordance with one embodiment of the present invention. for saving substrings, where j is relatively small compared 

In a simple illustration, a raw data string 302 represents a to g, the offset or gap between each saved substring will very 

text file such as a word processing document. Shown above likely be close to g but will not be guaranteed to be g. Based 

string 302 are position indicators 304 that show positions 0 on probability theory, the gap will typically vary between a 

through 16 in string 302. As discussed in step 106 of FIG. 60 few positions before and a few positions after each gth 

1, the raw data string is preprocessed or translated to place position in the string. If g is set to two and j is one, the 

it in a form suitable for segmenting and hashing. A translated segmenting would not be different from the segmenting 

data string 306 shows an example of how raw data string 302 shown in FIG. 3; that is, substrings would be chosen strictly 

can be translated. Translated string 306 is a string of by g (whenever j is set to one). In another preferred 

characters with capitalization, white spaces, and punctuation 65 embodiment, every gth substring of length 1 is hashed and 

removed. Further preprocessing of raw data string 302 could stored in the index file. By using this method, the program 

include removing words "this" and "is" under the assump- can guarantee that if there is the same passage of length 1+g 
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in two or more documents, the program will detect a same 
passage of length 1. 

FIG. 4a is a flowchart showing in greater detail step 126 
of FIG. lb in which a current document is clustered based 
on matches with documents previously loaded into the index 5 
file. The input for a step 402 is a list of matches that was 
created at step 114 of FIG. la. FIG. 4b is an illustration of 
a format of a match list in accordance with one embodiment 
of the present invention. In the described embodiment, this 
list contains at least three items of information: a hash value 10 
210, its position 214 in the current document, and a list of 
positions 0, in other (previously indexed) documents that 
have the same hash value 214. However, it is possible that 
a hash value may appear two or more times in the same 
document and may have been stored in the index file. In this 15 
case, the matching 0,- position represents a position in the 
same document as opposed to the more typical situation of 
representing another document. The hash value and position 
pair is shown in FIG. 4b as tuple 416. Associated with tuple 
416 is a list 418 containing at least one position value 0 lt 2 q 
shown as item 420, indicating a position in another docu- 
ment that contains the same hash value 210. The current 
document can have other hash values that were also matched 
with hash values in other documents represented by tuples 
422 and their corresponding position lists. 15 

At step 402 each list is expanded into pairs or tuples in 
which hash values have been eliminated and that contain 
only position values. FIG. 4b also shows an expanded 
position list 424 created at step 402. This list is created by 
pairing each position in the current document with each 30 
matching position 0, in other documents. List 424 includes 
a series of tuples where each tuple 426 has a position value 
214 from the current document and a position value 420 
from another document. However, as mentioned earlier, it is 
possible that a hash value may appear two or more times in 35 
the same document and may have been stored in the index 
file. In this case, the matching 0,- position represents a 
position in the same document as opposed to the more 
typical situation of representing another document. Thus, in 
each list 424, position value 214 of the current document 40 
will be the same but the position values 0, from the other 
documents will be different. This is done for all position 
values in the current document that have matches in other 
documents. Typically, in applications where the documents 
are not expected to have many similar passages, these lists 45 
are Dot very long and can be stored in main memory for 
quick access. 

At a step 404 the expanded list of pairs 424 created at step 
402 is sorted based on the position values 420 indicating 
matching positions in the other documents. This creates a 50 
single list of tuples sorted such that position values 0, from 
a single other document (i.e. a document that has already 
been indexed) are grouped together sequentially in the list. 
FIG. 4b contains an illustration of a list sorted according to 
position values in other documents. As shown in a list 428, 55 
position values 420 are increasing. As a result, position 
values from the current document become unordered or 
random. At a step 406, list 428 is segmented where each 
segment 430, for example, represents a single document. In 
the described embodiment, the segmenting is done using the go 
B-tree described above. Using the B-tree, which contains the 
beginning and ending positions of documents stored in the 
index file, the program can determine where the boundaries 
of the documents are in the sorted list. 

At a step 408 the program retrieves a segment, represent- 65 
ing a single document, from the sorted list. At a step 410, a 
ratio C/T is computed for the retrieved document. The 
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similarity of two documents is defined as ratio C/T, where C 
is the number of hashes the two documents have in common 
and T is the total number of hashes taken of one of the 
documents, which can be the current document or the 
smaller document. In the described embodiment, the number 
of hashes the two documents have in common is equal to the 
number of position pairs in the segment representing the 
retrieved document. The total number of hashes T can be 
drawn from the B-tree which also stores the total number of 
hashes in each document. By using this ratio, the percentage 
similarity between the current document and the document 
chosen at step 408 from the sorted segment list can be 
calculated. 

At a step 412 a threshold is used to discard the retrieved 
document if the document does not contain a certain match 
ratio. In the described embodiment, if C/T is less than the 
threshold (e.g. a predetermined system parameter), the 
matches associated with the retrieved document are 
discarded, thereby effectively eliminating the document 
from further analysis. By performing this filtering operation, 
only documents having an interesting or significant number 
of matches with the current document are retained. The 
value of the threshold is based on a policy decision as to 
what level of similarity is significant given external factors, 
such as the type of documents being compared. Thus, at step 
412 the program determines if the retrieved document has a 
sufficient number of matches. If not, control returns to step 
408 where the next document segment in the sorted list is 
retrieved. If the number of matches in the retrieved docu- 
ment is significant, control goes to a step 414. 

At step 414, the program clusters the retrieved document 
with existing clusters of documents. The purpose for clus- 
tering is to determine whether there are other groups of 
documents of which the current document can be part based 
on similarities. In the described embodiment, the clustering 
is used to present in a meaningful way to the user passages 
of similar text from groups of documents where each group 
is expected to have at least some similar passages. If the 
current document is not grouped with an existing cluster, it 
creates its own single-document cluster, which can subse- 
quently be clustered with incoming documents and existing 
clusters. In another preferred embodiment the clustering can 
be done after all the documents in the collection have been 
indexed, which can be referred to as batch clustering as 
opposed to incremental clustering described above. 

FIG. 4c is an illustration of a data structure showing how 
documents can be clustered in accordance with one embodi- 
ment of the present invention. Shown are three clusters 432, 
434, and 436. A current document 438 is brought in. The 
clustering operation may be performed using a standard 
union/find algorithm where the program first determines to 
which set or existing cluster the document belongs. The 
program then takes the union of the current document and 
the set of retrieved documents (i.e. those documents 
retrieved at step 408). This can be done by taking a repre- 
sentative element or document from an existing set or cluster 
and comparing it to the current document. If the element in 
the current document is found in the cluster, the document 
can be unioned with the cluster. The two previously existing 
sets (the current document being one set) are eliminated and 
a new cluster is formed. This is a well-known procedure and 
can be done in nearly linear time. The union either results in 
the current document being joined or clustered with a set of 
retrieved documents or, if there is no union, a new single- 
document cluster made up of the current document. It is also 
possible that the current document belongs to two or more 
existing clusters in which case the clusters are joined to form 
yet a larger cluster of documents. 
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FIG. 5 is a flowchart showing in greater detail step 130 of 
FIG. lb of coalescing matching substrings into passages and 
presenting output to users in accordance with one embodi- 
ment of the present invention. For the purpose of illustrating 
a preferred embodiment of the coalescing operation of the 5 
present invention, a cluster containing two documents is 
described. The methods and techniques described below for 
a cluster of two documents can be extended to coalesce 
documents in a cluster containing multiple documents, and 
is not intended to be limited to clusters of a pair of 10 
documents. 

The coalescing procedure operates on a cluster of docu- 
ments that was formed at step 414 of FIG, 4a and shown in 
FIG. 4c. Thus, documents that are potentially coalesced are 
those documents from a single cluster. At a step 502, one is 15 
selected (the "current cluster") from the group of clusters. In 
the described embodiment, the data structure representing 
the clusters can be kept in main memory instead of on disk 
given the typically small amounts of memory needed to 
store cluster data, although the size can vary according to the 20 
application and type of documents being compared. The 
coalescing operation is performed on a cluster because a 
cluster is a much smaller set of documents compared to the 
potentially huge collection of documents and are far more 
likely to have significant similarity. In another preferred 25 
embodiment, the coalescing operation can be performed 
without the clustering procedure thereby using the original 
full set of documents. This may be preferred if the original 
set of documents is small. At a step 504 the program flags 
all substrings that appear more than once in a document in 30 
order to process duplicate passages (appearing two or more 
times) in a document more efficiently. This is done by 
examining the hash values encoded in the 0's. At a step 506 
the program finds all sequences of unique position pairs 
among all the documents in the current cluster and coalesces 35 
those pairs into longer segments. This operation begins by 
examining the sorted list created in steps 404 and 406 of 
FIG. 4a and illustrated in FIG. 46, where the list of position 
pairs are sorted according to previously indexed documents 
(0,- values). At step 406 the sorted list is segmented into 40 
documents that have already been loaded in to the index file 
(i.e. hash table). 

Step 506 is performed by first checking each position 
(e.g., value 420 in FIG. 4b) in the sorted list corresponding 
to the documents in the current cluster. For each position 45 
pair in the sorted list, the program checks whether the 0, 
values 418 are in sequence by referring to the B-tree. In 
order to be in sequence, a value 0 i+1 should not precede 0 ( . 
Thus, the program scans the sorted list and determines 
whether the next 0 i position in the list is adjacent to the 50 
current 0 position. Since the length 1 is fixed, adjacency can 
be determined to be true if 0 ( -0 /+1 -l. This calculation 
indicates whether the two current 0 ( positions are adjacent 
(or overlapping), or whether there is a gap or disjoint 
between them. Data in the B-tree can be used to determine 55 
the values for the 0 positions. If the difference between those 
values is equal to or less than 1, they are considered to be in 
sequence. Similarly, each P position (e.g., value 214 in 
FIGS. 2 and 4b) in the position pair is examined to see if it 
is in sequence with the P position in the next position pair, 60 
and whether the differences in length is the same as the 
difference in length between the 0 positions. In the described 
embodiment, this can be done by checking whether 0.-0 l+ 
i*PrVi+v ^ these conditions are met, the program coalesces 
position pairs to form a single position pair with an associ- 65 
ated length where the length is greater than 1 depending on 
how many position pairs were found to be in sequence. 
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Thus, the resulting list of position pairs will likely have 
fewer position pairs than the original sorted list and some of 
the pairs will have an associated length value greater than 1. 
This check can be extended to cover situations where the 
program detects similarities among three or more documents 
(in addition to or to the exclusion of detecting similarities 
between two documents). This can be done by checking 
whether 0 I -^0 l>1 =p,--p l - +1 oN I --N^ 1 , where N represents a 
third document in the cluster. 

At a step 508 pairs of passages that overlap are identified 
and split up for all documents in the current cluster. The 
purpose of step 508 is to eliminate overlapping pairs that 
have the same offsets (i e. overlap the same amount) 
between two documents by segmenting the overlapping 
pairs into three non-overlapping passages. N This step simpli- 
fies processing since at this stage all disjoints in the string 
are eliminated. This is conveyed by the conditions described 
above with respect to step 506 (i.e. by checking if 0 ( <0 I+1 -1, 
and whether Of-O.^-p,— p (V1 ). Thus, every instance where 
the program detects the same overlapping pairs, the two 
overlapping passages are replaced with three segments: a 
first segment that consists of only the first passage, a second 
segment that corresponds only to the overlapping section, 
and a third segment that consists only of the remaining 
portion of the second passage. Anew name is assigned to the 
middle overlapping portion and the hash values for the two 
segments are reassigned to the (now shorter) non- 
overlapping sections. 

This is done by first scanning the sorted list (sorted by 0^) 
and making note of all places where there are overlapping 
0's by examining their positions in the B-tree. In another 
preferred embodiment, the difference between 0 /+1 and 0 f 
can be determined and compared to 1. If the difference is less 
than or equal to 1, the segments overlap. This information is 
stored in a temporary data structure. The information is used 
to replace all instances of the overlapping passages with the 
three new passages. The program searches the index file for 
the hash value of the first passage. Once it is found, record 
212 of FIG. 2 will indicate all the positions that the hash 
value occurs in the corpus of documents. Those positions 
that fall within any of the documents in the current cluster 
are replaced with the new hash values. 

A similar procedure is applied to the P positions in the 
sorted list. First, the list is sorted based on p, instead of 0. 
The program then checks for overlaps in P by using position 
data in the B-tree. Similarly, in other preferred 
embodiments, overlaps in P can be determined by compar- 
ing the difference between p ( -p (+1 to 1 since the position 
pairs have been segmented into documents and the program 
is checking for overlaps within a single document. For those 
overlaps that have the same offset as overlaps in the 0 
positions, the information stored in the temporary data 
structure is used to replace the overlapping P passages. Since 
other position pairs may contain the P value being changed, 
the P value in those pairs are changed as well to keep the 
value consistent. In the described embodiment, the tempo- 
rary data structure maps hash values of segments to posi- 
tions of those segments in a document. 

At a step 510 filler or dummy passages are inserted to fill 
any gaps between passages. This is done for each document 
in the current cluster to facilitate subsequent string opera- 
tions. This gap should not be interpreted by the program to 
mean that the first pair and the pair following the gap are 
adjacent. The purpose is to create a continuous string of 
non-overlapping segments or passages. Step 510 further 
simplifies the string transforming it to an abstract of the 
original document. In the described embodiment, each filler 
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passage is sized to exactly fit each gap in the sequence of embodiment, each lower case letter in the two strings can be 

passages making up a document. represented by a color. The text represented by those pas- 

At a step 512 the program finds the maximum length sages are presented to the user in a particular color and the 

passage that matches between the two documents in the user can compare passages that have the color in two or 

current cluster and then removes that passage. In the s more documents to see actual text and the location of the text 

described embodiment, the steps preceding step 512 in FIG. that appears in those documents. For example, with the two 

5 transform the documents in a cluster into efficient and strings above, the passage FLKcan be shown in red in both 

manipulable strings of segments that can now be processed documents and the passage AB can be shown in blue. The 

to detect similar passages. Because the documents have been use r can then determine quickly what passages are the same 

simplified to the form of these strings, the problem of 10 in the two documents. In other preferred embodiments, other 

identifying similar passages has been substantially reduced. indicators can be used to efficiently present similar passages 

In the described embodiment, the program focuses attention in the documents to the user. For example, similarities 

on a few documents (i.e., a cluster) that actually do have among documents can be shown using graphical summaries, 

some common material out of a potentially large collection such as colored histograms or multiple color bars, which 

of documents. 35 correspond to the colored text described above. In addition, 

One method of performing step 512 is a brute force passages from multiple documents can be aggregated in 

algorithm that keeps a marker and counter for the two different ways depending on the user interface and user 

documents or strings. For example, for the two strings: ne . eds> For example, information on a subset of documents 

string 1* HFLKXAB within a cluster can be presented to a user in cases where 

strin 2* ABZFLKW 20 ^ n ^ orma ^ on on me ^ ^ °^ documents would be skewed. 

c . i 1 « ti- • * • -■ j l i can occur because of one unusual document in the 

the program first places a marker at H in string 1 and checks , , , t , . 4 . „ . .. . 4 . 

•p*u • it * a ■ *\ t it. • *t_ • tt • . • cluster having properties that throw off similarities present 

it there is an H in string 2. In this case there is no H in string • 4 . , u j L * u *u ■ -i i_ 

n . . i - *i_ * y-i . m the other documents, where those similarities may be 

2 so the counter is zero. The marker is then moved to F in „ - f t . _.. 0 4l _ J 

.„ . A . , , more informative to the user. At a step 518 the program 

^ /^ ram , W u * ™ ° Q S 25 retrieves the next cluster of documents as derived in step 414 

when it hits the firet F in string 2. The marker will then read 0 fFIG. 4a and repeats the process from step 502. If there are 

the L in string l, match it with the L in stnng 2, and increase nn mrtr<1 , ■ , a r 

*t_ . * / at. j- xr • l A. . • no more clusters the process is done. 

the counter to two. After reading the K in both strings, the f „ niUa afo t Q , i- , f , u 

.„ , . j » *l /.u • • In otner preterred embodiments of the present invention, 

counter will be increased to three (the counter is mere- j rt „„ mM * f a ♦ u * • i 1 1 f , u 

.it r . ; . 4t . documents found to have a certain level of matches or a 

mented when the sequence of characters is the same). The _ „ Qr ,;™,u, „^,™*o„~ ~f ^u- ta t t * u 

V, «. » /• i t . * . .% 30 particular percentage oi matching text or content, can be 

program continues until the end of the strings and notes that A - „ . AA a • c i • • . r . . t- 

f. i , , . ™ r ?Z, , , . . digitally shredded. This is useful in a variety of contexts. For 

the longest substring was three. The FLK substring is T c * j * / L 

j . , .z, , . , , example, copies ol a master document (often a proprietary 

assigned a new identifier, such as a unique number or letter, A L \\ u a- * *u ♦ a ■ ^ 

• ■ .t , ' . . , . . . , ' document) can be distributed within an organization. Over 

and is then removed or nagged to indicate that it has already 4 . 4 . . .« ln , , j-Jr j j 

. . . *u * a T hme, these copies will likely be modified, partitioned, 

been examined, so the program can perform a step 514. In , A A t - i a * n i c . j * 

c -, . . i j-. ^. 35 storea, duplicated, etc. Examples of a master document are 

another preferred embodiment, the edit difference between „ « • , 

.... . * j j r it- a lL i numerous: a business plan, a computer program, a 

the strings can be computed and from that the maximal • 4 t , . T4 , _, . , , * 

, , . & i i ■ j t*l* , i j , j manuscript, a project proposal, etc. It may be desirable at 

matching passages can be denved. This method can be used • *r « u j « * r.u 

i > *u I . c i -. L j •« » , . some point to digitally shred all remnants or copies of the 

in place of the brute force algorithm described above or in A t , , , c i a 

r . t . . . <( & J>t „ . . . document and keep only one copy, for example, after a 

conjunction with it. An "edit distance" between two „ n u M „ ^^i^Ia nt . n « u^.. k«« 

j . . . . iL ! A , - . . , „ 40 review has been completed or a project has been terminated, 

documents, or strings, is the length of a minimal sequence of r™. 4 * u r * u u j * 

' . j u .i- r / That is, such a feature would be advantageous in a comput- 

msertions, deletions, and substitutions that transforms one to . . . . . , 

the other ^la^.^^iu mg environment where it would be undesirable to keep 

At ' g+A.i. • j i- • , complete, modified, or partial copies of a document in an 

At step 514 the same process is repeated for successively •„ * Z ■ ♦* T i • 

r . , r , , t „ /- „ orgamzation after a certain time. In a large organization, 

non-increasing length matches until the length of the ^ • c * L a . j- ■ * • i i j l 

, . , . . ■ i . . ™ 7 45 copies of the document can disseminate quickly and be 

matches decrements to single characters. Thus, the program . «. . , r . A - ; 

. , iL , , , ,i iT * • • stored in vanous places. In one embodiment of the present 

would then detect the AB passage in the two strings and n *u * /• \ a * ■ 

. . , f T * , , ^""fr invention, after the master (i.e., query) document is com- 

assign a unique identifier to it. In the described embodiment, A u i t . c > A . , , ■ , 

ii u , . < i i_ j . i_ . rr • . pared with the corpus of copied, modified, and other denved 

all characters that had no matches, such as H or W in stnng j a * *u 

„ ! 1U . .. c ... • • , . r i ^7 documents stemming from the master document, the user 

2, keep as their identifier their original hash values. Thus, . 4 - , . c *u 

' r f « . 4 ._ ^ . . , ' 50 has the option of shreddmg or purging from the computing 

assuming the following identifiers for the passages in the . t e 11 r *u a * r 

two strin s* environment of any or all of these documents from the 

* ' corpus. Thus, the user is presented with a list of all matching 

documents and can digitally shred any of those documents 

before exiting the comparison program. 

H: h Z: z 55 In another embodiment of the present invention, before 
FLK: m W: w exiting the comparison program and after digitally shredding 
X: x selected documents, a user can perform a "scribbling" 
\ I function. Although shredding or deleting selected docu- 
ments and files generally purges those files from a comput- 
the strings can be represented as: "hmxi" and "izmw". Thus, 60 ing environment, techniques are available that allow the 
the strings on which matching is now performed have letters recovery of those files from the memory storage areas from 
as identifiers that represent many more characters in the which they were erased. A scribbling function allows a user 
original document. Each of the identifiers in these strings to write over those areas numerous times to eliminate the 
have associated position and length information. possibility of recovering the digitally shredded files. In one 
After step 514, the program can use more expensive 65 embodiment, the program writes only one's over the 
techniques on these simplified strings to present similar memory spaces that stored the shredded files followed by 
passages to the user at a step 516. In a preferred only zero's, followed by one's, etc. This is done as many 
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times as necessary to render any recovery techniques use- 
less. A user is presented with the option to scribble over the 
areas previously containing the shredded files when pre- 
sented with the list of matching files. 

FIG. 6 is a flow diagram describing a process of compar- 
ing a query document or string against a corpus of docu- 
ments and deleting those corpus documents matching the 
query document in accordance with one embodiment of the 
present invention. The same process can be used in various 
document comparison configurations, one example of which 
is finding all matching passages in a corpus of documents 
(i.e., where there is no specific query document), as used in 
previous examples. In this configuration, for example, the 
user can delete all documents containing a particular passage 
except one. In the situation where there is one or more 
specific query documents, the user can chose to delete all 
documents that match more than a certain percentage with 
the query documents) and keep only the query documents). 
Other scenarios can arise. However, a process of identifying 
the documents that match and also providing, in the same 
program, the option of deleting those documents, as 
described below, can be used in various other document 
comparison scenarios. 

At a step 602 a query document is compared with each 
document in a corpus of documents. In a preferred 
embodiment, the matching process can be based on the one 
described above utilizing either the coalescing or clustering 
features, or both. In another preferred embodiment, it can 
also be a matching process that utilizes the hash scheme as 
described in step 106 through 124 of FIG. 1 and hash index 
as shown in FIG. 2, or other similar hashing schemes. In 
another preferred embodiment, the matching process can 
include the translation process described in FIG. 3 or some 
variation thereof, or other processes geared to making the 
comparison more efficient. In yet another preferred 
embodiment, the comparison routine can be a more conven- 
tional brute force algorithm that does not utilize any hashing 
or other described techniques, but rather compares each 
character in the documents. The user can chose to run a 
comparison program that provides as output a percentage of 
overlap or commonality between two documents or strings, 
without providing information on what the overlap text is or 
where it occurs. The deletion or "digital shredding" proce- 
dure described here is not dependent on the type of com- 
parison or matching process used. In sum, the comparison 
program can be any one chosen by the user to be suitable for 
the application at hand. 

At a step 604 a list of documents matching the one or 
more query documents is compiled. By compiling such a list 
and presenting it to the user through a digital shredder user 
interface, the user can immediately begin purging docu- 
ments from the network and not have to depend on external 
operating system commands or remembering names of 
documents to be deleted. At a step 606 documents flagged by 
the user for digital shredding are deleted from the user's 
computer or appropriate server or client if in a network 
environment. The deletion operation can be performed by 
normal operating system commands which are executed by 
the comparison program. The deletion operation can also be 
performed by instructing a non-operating system-type pro- 
gram that can communicate with or accept instructions from 
the comparison program of the present invention. Such 
non-operating system-type programs includes, for example, 
an application program, a browser program, a utility 
program, or other program capable of deleting files. At this 
stage the process of digitally shredding matching documents 
is complete. 
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As discussed above, the present invention employs vari- 
ous computer-implemented operations involving data stored 
in computer systems. These operations include, but are not 
limited to, those requiring physical manipulation of physical 

5 quantities. Usually, though not necessarily, these quantities 
take the form of electrical or magnetic signals capable of 
being stored, transferred, combined, compared, and other- 
wise manipulated. The operations described herein that form 
part of the invention are useful machine operations. The 

J0 manipulations performed are often referred to in terms, such 
as, producing, identifying, running, determining, comparing, 
executing, downloading, or detecting. It is sometimes 
convenient, principally for reasons of common usage, to 
refer to these electrical or magnetic signals as bits, values, 
elements, variables, characters, data, or the like. It should 

15 remembered, however, that all of these and similar terms are 
to be associated with the appropriate physical quantities and 
are merely convenient labels applied to these quantities. 

The present invention also relates to a device, system or 
apparatus for performing the aforementioned operations. 

20 Tbe system may be specially constructed for the required 
purposes, or it may be a general purpose computer selec- 
tively activated or configured by a computer program stored 
in the computer. The processes presented above are not 
inherently related to any particular computer or other com- 

25 puting apparatus. In particular, various general purpose 
computers may be used with programs written in accordance 
with the teachings herein, or, alternatively, it may be more 
convenient to construct a more specialized computer system 
to perform the required operations. 

30 FIG. 7 is a block diagram of a general purpose computer 
system 700 suitable for carrying out the processing in 
accordance with one embodiment of the present invention. 
FIG. 7 illustrates one embodiment of a general purpose 
computer system. Other computer system architectures and 

35 configurations can be used for carrying out the processing of 
the present invention. Computer system 700, made up of 
various subsystems described below, includes at least one 
microprocessor subsystem (also referred to as a central 
processing unit, or CPU) 702. That is, CPU 702 can be 

40 implemented by a single-chip processor or by multiple 
processors. CPU 702 is a general purpose digital processor 
which controls the operation of the computer system 700. 
Using instructions retrieved from memory, the CPU 702 
controls the reception and manipulation of input data, and 

45 the output and display of data on output devices. 

CPU 702 is coupled bi-directionally with a first primary 
storage 704, typically a random access memory (RAM), and 
uni-directionally with a second primary storage area 706, 
typically a read-only memory (ROM), via a memory bus 

50 708. As is well known in the art, primary storage 704 can be 
used as a general storage, area and as scratch-pad memory, 
and can also be used to store input data and processed data. 
It can also store programming instructions and data, in the 
form of message stores or shared allocated memory holding 

55 thread-specific data cells, in addition to other data and 
instructions for processes operating on CPU 702, and is used 
typically used for fast transfer of data and instructions in a 
bi-directional manner over the memory bus 708. Also as 
well known in the art, primary storage 706 typically includes 

60 basic operating instructions, program code, data and objects 
used by the CPU 702 to perform its functions. Primary 
storage devices 704 and 706 may include any suitable 
computer-readable storage media, described below, depend- 
ing on whether, for example, data access needs to be 

65 bi-directional or unidirectional. CPU 702 can also directly 
and very rapidly retrieve and store frequently needed data in 
a cache memory 710. 
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A removable mass storage device 712 provides additional 
data storage capacity for the computer system 700, and is 
coupled either bi-directionally or uni-directionally to CPU 
702 via a peripheral bus 714. For example, a specific 
removable mass storage device commonly known as a 5 
CD-ROM typically passes data uni-directionally to the CPU 
702, whereas a floppy disk can pass data bi-directionally to 
the CPU 702. Storage 712 may also include computer- 
readable media such as magnetic tape, flash memory, signals 
embodied on a carrier wave, PC-CARDS, portable mass 10 
storage devices, holographic storage devices, and other 
storage devices. A fixed mass storage 716 also provides 
additional data storage capacity and is coupled 
bi-directionally to CPU 702 via peripheral bus 714. The 
most common example of mass storage 716 is a hard disk 15 
drive. Generally, access to these media is slower than access 
to primary storages 704 and 706. Mass storage 712 and 716 
generally store additional programming instructions, data, 
and the like that typically are not in active use by the CPU 
702. It will be appreciated that the information retained 20 
within mass storage 712 and 716 may be incorporated, if 
needed, in standard fashion as part of primary storage 704 
(e.g. RAM) as virtual memory. 

In addition to providing CPU 702 access to storage 
subsystems, the peripheral bus 714 is used to provide access 25 
other subsystems and devices as well. In the described 
embodiment, these include a display monitor 718 and 
adapter 720, a printer device 722, a network interface 724, 
an auxiliary input/output device interface 726, a sound card 
728 and speakers 730, and other subsystems as needed. 30 

The network interface 724 allows CPU 702 to be coupled 
to another computer, computer network, or telecommunica- 
tions network using a network connection as shown. 
Through the network interface 724, it is contemplated that 
the CPU 702 might receive information, e.g., data objects or 35 
program instructions, from another network, or might output 
information to another network in the course of performing 
the above-described method steps. Information, often rep- 
resented as a sequence of instructions to be executed on a 
CPU, may be received from and outputted to another 40 
network, for example, in the form of a computer data signal 
embodied in a carrier wave. An interface card or similar 
device and appropriate software implemented by CPU 702 
can be used to connect the computer system 700 to an 
external network and transfer data according to standard 45 
protocols. That is, method embodiments of the present 
invention may execute solely upon CPU 702, or may be 
performed across a network such as the Internet, intranet 
networks, or local area networks, in conjunction with a 
remote CPU that shares a portion of the processing. Addi- 50 
tional mass storage devices (not shown) may also be con- 
nected to CPU 702 through network interface 724. 

Auxiliary I/O device interface 726 represents general and 
customized interfaces that allow the CPU 702 to send and, 
more typically, receive data from other devices such as 55 
microphones, touch -sensitive displays, transducer card 
readers, tape readers, voice or handwriting recognizers, 
biometrics readers, cameras, portable mass storage devices, 
and other computers. 

Also coupled to the CPU 702 is a keyboard controller 732 60 
via a local bus 734 for receiving input from a keyboard 736 
or a pointer device 738, and sending decoded symbols from 
the keyboard 736 or pointer device 738 to the CPU 702. The 
pointer device may be a mouse, stylus, track ball, or tablet, 
and is useful for interacting with a graphical user interface. 65 

In addition, embodiments of the present invention further 
relate to computer storage products with a computer read- 



able medium that contain program code for performing 
various computer-implemented operations. The computer- 
readable medium is any data storage device that can store 
data which can thereafter be read by a computer system. The 
media and program code may be those specially designed 
and constructed for the purposes of the present invention, or 
they may be of the kind well known to those of ordinary skill 
in the computer software arts. Examples of computer- 
readable media include, but are not limited to, all the media 
mentioned above: magnetic media such as hard disks, floppy 
disks, and magnetic tape; optical media such as CD-ROM 
disks; magneto-optical media such as floptical disks; and 
specially configured hardware devices such as application- 
specific integrated circuits (ASICs), programmable logic 
devices (PLDs), and ROM and RAM devices. The 
computer-readable medium can also be distributed as a data 
signal embodied in a carrier wave over a network of coupled 
computer systems so that the computer-readable code is 
stored and executed in a distributed fashion. Examples of 
program code include both machine code, as produced, for 
example, by a compiler, or files containing higher level code 
that may be executed using an interpreter. 

It will be appreciated by those skilled in the art that the 
above described hardware and software elements are of 
standard design and construction. Other computer systems 
suitable for use with the invention may include additional or 
fewer subsystems. In addition, memory bus 708, peripheral 
bus 714, and local bus 734 are illustrative of any intercon- 
nection scheme serving to link the subsystems. For example, 
a local bus could be used to connect the CPU to fixed mass 
storage 716 and display adapter 720. The computer system 
shown in FIG. 7 is but an example of a computer system 
suitable for use with the invention. Other computer archi- 
tectures having different configurations of subsystems may 
also be utilized. 

Although the foregoing invention has been described in 
some detail for purposes of clarity of understanding, it will 
be apparent that certain changes and modifications may be 
practiced within the scope of the appended claims. 
Furthermore, it should be noted that there are alternative 
ways of implementing both the process and apparatus of the 
present invention. For example, the hash function can be 
applied to variable length substrings instead of fixed length 
substrings. In another example, data structures other than a 
hash table, such as a neural network, can be used to 
implement the index file. In another example, methods other 
than the union/find algorithm can be used to cluster docu- 
ments. In yet another example, a binary tree or table can be 
used in place of a B-tree for storing document name and 
range information. In addition, although the present inven- 
tion has been described in the context of detecting plagia- 
rism (copying) among a set of documents, it has many other 
applications. For example, it can be used in the legal field for 
litigation support, intellectual property security, checking 
for document updates, providing automatic version history, 
providing copyright protection on the Internet, merging 
redundant program code segments, and software clone 
detection. The program can also be used as a supplement to 
or as a component in other computer-based applications 
such as search engines, database systems, document man- 
agement systems, file systems, and information retrieval. 
Accordingly, the present embodiments are to be considered 
as illustrative and not restrictive, and the invention is not to 
be limited to the details given herein, but may be modified 
within the scope and equivalents of the appended claims. 

What is claimed is: 

1. A computer readable medium containing programmed 
instructions for simultaneously digitally shredding two or 
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more second strings that match a first string, the pro- one or more groups of substrings which occur in the 

grammed instructions comprising: same relative positions in the first and second strings, 

a computer code for comparing the first string with a 7. A method as recited in claim 2 wherein comparing the 

plurality of second strings; flr ? 1 strin S with a plurality of second strings further com- 

a computer code for compiling a list of second string 5 pn ^f*- • „ n . . - , ... c c . t . 

names from the plurality of second strings wherein Xtrings 8 § " P * 8 

each second string corresponding to a name from the _i_ 4 - « * * • u . • * 4L , c c . 

list of second string names matches the first string to a "^^mS^ ^ ^ ° f ^ 

^ZtfZl hm " Predetermmed thresh ° ld degree ° f 10 searching a stor.?area storing a plurality of ordered file 

similarity, ana substrings for the first string substring; 

a computer code for electronically shredding one or more storing match data reUting to a match betwcen thc first 

second string names from the list thereby eliminating string substring and a first ordered file substring; and 

copies and remnants of the first string from the data joining ^ fifSt Qrdered ^ substring and a 

processing system. ]5 ordered fik substri tf the first ordered file substring 

2. In a data processing system, a method of simulta- ^ , ^ , , fi1 . . . . 

ij*. 11LJJ* 1 . and me second ordered file substring are in a particular 

neously digitally shredding two or more second strings that M „„„ „ „„j •? *u ^ * , • T. * • j a . 

# u « . . ■ *u • « . . & sequence and if the first string substnng and a first 

match a first string, the method composing: ot , o , , . . & . , 

& * ^ » string second substring are in the same particular 

(a) comparing the fist string with a plurality of second sequence wherein the second ordered file substring and 
strings; 20 the first string second substring match, thereby forming 

(b) compiling a list of second string names from the a third coalesced ordered file substring and a first string 
plurality of second strings wherein each second string third substring that is coalesced. 

corresponding to a name from the list of second string 8. In a data processing system, a method of comparing a 

names matches the first string to a degree higher than first string and a second string, the method comprising: 

a predetermined threshold degree of similarity; and 25 (a) identifying a plurality of substrings common to the 

(c) electronically shredding one or more second string first and second strings; 

names from the list thereby eliminating copies and (b) identifying at least a subset of said plurality of 

remnants of the first string from the data processing substrings which occur in the same relative positions in 

system. the first and second strings; and 

3. A method as recited in claim 2 wherein comparing the 30 ( c ) storing as a group or displaying as a group, at least 
first string with a plurality of second strings further includes: temporarily, those substrings which occur in the same 

(a) identifying a plurality of substrings common to the relative positions in the first and second strings. 

first string and a second string from the plurality of 9. The method of claim 8, wherein the first and second 

second strings; strings are computer documents containing ASCII charac- 

(b) identifying at least a subset of said plurality of " ters - 

substrings which occur in the same relative positions in 10 - ^ method of claim 8, wherein identifying the 

the first and second strings; and plurality of common substrings comprises: 

(c) storing as a group, at least temporarily, those sub- ® dividin S the first strin S int0 ^strings and hashing 
strings which occur in the same relative positions in the 40 mose to P rovide a firet collection of hashes; 
first and second strings. ( u ) dividing the second string into substrings and hashing 

4. A method as recited in claim 3 wherein identifying a tnosc substrings to provide a second collection of 
plurality of substrings common to the first and second hashes and comparing hashes of the second collection 
strings further includes: with the first collection of hashes; and 

(i) dividing the first string into substrings and hashing 45 ("0 identifying those hashes in the first and second 

those substrings to provide a first collection of hashes; „ c^ 101 * °/ hashes that match. 

/-\ j • • j * .1 * , . . . . A . - . . . 11. The method of claim 8, wherein identifying at least a 

(u) omdingtoe second string into substrings and hashing subset of said luraljt of substri comprises: 

those substrings to provide a second collection of /A . 4 , . t . ... , „ 

hashes and comparing hashes of the second collection « """P*"? ' el ?" ve P° s f> ns w. hin the first and 

with the first collection of hashes; and 50 «™g of all matched pairs of substrings com- 

mon to the first and second strings; 

(in) identifying those hashes in the first and second , r \ •j 0 „ t jf„:„„ „ *«u^ • a a * u a 

m r i_ l .1- . * u W identifying a first matched pair and second matched 

collections of hashes that match. • u • „ „.,u„* ■ «• - u *u * - 

m a t . j . « . , . - . ... . pair having substrings are contiguous in both strings or 

5. A method as recited in claim 3 wherein identifying at j * i • u *u * • j 
t t « * c ■ j t 1-* c Li - n • 1 T possess a same degree of overlap in both stnngs; and 
least a subset of said plurality of substrings further includes: ..... . , « , , 

. , ... ,55 (m) grouping the first and second matched pairs, 

(i) comparing the relative positions within the first and u ^ method of daim g wherein stori Qr dis l i 

second string? of all matched pairs of substrings com- substd as a ^ s di la b c * ontiguo F us "J. 

mon to the first and second strings; lectioQS of substrings common to the first and striQgs> 

(n) identifying a first matched pair and second matched 13, ^ me thod of claim 8, wherein identifying at least a 

pair having substrings contiguous in both strings or 60 subset of said plurality of substrings comprises identifying 

possessing a same degree of overlap in both strings; substrings which occur in the same relative positions in a 

anc * third string as well as the first and second strings. 

(iii) grouping the first and second matched pairs. 14. A method of segmenting a file as part of a file 

6. A method as recited in claim 2 wherein compiling a list matching operation, a file representable by a string of 
of second string names further includes: 65 characters, the method comprising: 

determining a subset of the plurality of second strings creating a plurality of segments from the string of 

wherein each second string in the subset corresponds to characters, each one of the segments the plurality of 
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segments having a predetermined length and a begin- 20. A method as recited in claim 18 further comprising 
ning position; deriving an identifier corresponding to the first query sub- 
maintaining a predetermined offset between the beginning strin S usin S a predetermined function and using the identi- 
position of each consecutive one of the plurality of fier V° Pf rform s^hes in the storage area and identify 
segments, wherein the predetermined offset is at least 5 matc *es between the plurality of query file substrings and 
two character positions in length; and the plurality of ordered file substring 

. 21. A method as recited m claim 18 further comprising 

executing a file matching operation using the plurality of determining whether the query file can be integrated with 

segments and predetermined offset whereby the file 0 ne or more groups of stored files by comparing the query 

matching operation will detect a similar passage file with a stored file from each of the one or more groups 

between three or more files where the passage has a of stored files. 

length of at least the sum of the predetermined length 22. A method as recited in claim 21 further comprising 

and the predetermined offset. qualifying a query file for integration with one or more 

15. A method as recited in claim 14 wherein the file groups of stored files by examining the number of matches 
matching operation stores in a segment storage area a between the plurality of query file substrings and ordered file 
segment from the plurality of segments that has a beginning 35 substrings from a particular stored file. 

position at a position in the string of characters that is a 23. A method as recited in claim 18 wherein the match 

multiple of the predetermined offset. data includes a plurality of query file substring positions 

16. A method as recited in claim 14 wherein the file P aired with a plurality of corresponding ordered file sub- 
matching operation compares every segment of the prede- strin S P 0 ^ 00 ^ the corresponding ordered file substrings 
termined length in the string of characters against a plurality 20 arran S ed m segments corresponding to stored files. 

of loaded segments in a segment storage area. . 2 f K ™*?* aS ™* m S° 

17. A method as recited in claim 14 wherein the file firet ordered file substring and the second ordered file 
. , . «, i < substring further comprises eliminating overlaps between 

nut^gpngi^doicsuiasBgaKatstongeucaa seg- two ord f r ed file substrings, 

ment Irom the plurality ol segments it the segment has a me , hod 

as recited in claim 24 further comprisine 
beginning position at a position in the string of characters 25 se g mentiag the two ord6red file substrings into three sub- 
that is an even multiple of a predetermined value and if the segments including a first sub-segment formed from a first 
number of characters from the beginning position of a last of the ^ ordered file substrings, a second sub-segment 
stored segment is at least the length of the predetermined formed from an overlap between the two ordered file 
offset- substrings, and a third sub-segment formed from a second of 

18. A method of comparing a query file to two or more the two ordered file substrings. 

stored files, the method comprising: 26. A method as recited in claim 18 further comprising 
receiving a query file having a plurality of query file identifying a longest length match between a plurality of 
substrings; third coalesced ordered file substrings and a plurality of third 
selecting a first query file substring from the plurality of 35 coalcsccd q uerv filc substrings and removing third coalesced 
query file substrings, wherein an offset between two indexed file substrin gs and third coalesced query file sub- 
consecutive query file substrings is at least two char- stnn S s corresponding to the longest length match, whereby 
acter positions in length; duplicate query file substrings and ordered file substrings do 
. , . . ,. 4 f , , not effect output comparison data, 
searching a storage area storing a plurality of ordered file 2? A as ^ {n ^ 26 ^ prising 

substrings for the first query file substring; ,o repe a ting the identmcation and removal of the longest length 

storing match data relating to a match between the first matc h between the plurality of third coalesced ordered file 

query file substring and a first ordered file substring; substrings and the plurality of third coalesced query file 

and substrings, 

joining the first ordered file substring and a second 28. A method as recited in claim 27 further comprising 

ordered file substring if the first a ordered file substring 45 assigning the longest length match a unique name thereby 

and the second ordered file substring are in a particular transforming the plurality of query file substrings into a 

sequence and if the first query file substring and a simplified query file string and the plurality of ordered file 

second query file substring are in the same particular substrings into a simplified file string, wherein the simplified 

sequence wherein the second ordered file substring and query file string and the simplified file string include a 

the second query file substring match, thereby forming 50 plurality of unique names. 

a third coalesced ordered file substring and a third 29. A method as recited in claim 28 further comprising 

coalesced query file substring that can be used to assigning an indicator to each one of the plurality of unique 

format output comparison data. names for display as output comparison data associated with 

19. A method as recited in claim 18 further comprising the query file and one or more of the stored files, 
preprocessing the first query file substring thereby making 55 

the substring more suitable for searching in the storage area. ***** 
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